Accelerating Content-Defined Chunking for Data Deduplication Based on Speculative Jump
نویسندگان
چکیده
In data deduplication systems, chunking has a significant impact on the ratio and throughput. Existing Content-Defined Chunking (CDC) approaches exploit sliding window to calculate rolling hashes of input stream byte-by-byte, then determine chunk cut-points if hash satisfies given cut-condition. Since previous CDC are extremely costly, it often significantly degrades throughput systems. this paper, we argue that calculating checking byte-by-byte is unnecessary. To reduce CPU overhead CDC, propose xmlns:xlink="http://www.w3.org/1999/xlink">jump-based chunking (JC) approach. The key idea introduce jump-condition, can jump over specific length satisfy jump-condition. Moreover, also explore cut-condition jump-condition size. Our theoretic studies demonstrate effectiveness efficiency JC, without compromising ratio. Experimental results show JC improves by about 2× average compared with state-of-the-art while still guaranteeing high
منابع مشابه
FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems in the past 15 years or so due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cutpoints by computing and judging the rolling hashes of the data stream byte by byte. In this paper, we propose FastCDC, a Fast and eff...
متن کاملBimodal Content Defined Chunking for Backup Streams
Data deduplication has become a popular technology for reducing the amount of storage space necessary for backup and archival data. Content defined chunking (CDC) techniques are well established methods of separating a data stream into variable-size chunks such that duplicate content has a good chance of being discovered irrespective of its position in the data stream. Requirements for CDC incl...
متن کاملLeap-based Content Defined Chunking - Theory and Implementation
Content Defined Chunking (CDC) is an important component in data deduplication, which affects both the deduplication ratio as well as deduplication performance. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performance is limited in certain application scenarios since they have to slide byte by byte. The a...
متن کاملTwo Stage Max Gain Content Defined Chunking for De- duplication
––Data de-duplication is a very simple concept with very smart technology associated in it. The data blocks are stored only once, de-duplication systems decrease storage consumption by identifying distinct chunks of data with identical content. They then store a single copy of the chunk along with metadata about how to reconstruct the original files from the chunks, this takes up the less stora...
متن کاملA Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
Longxiang Wang 1, Xiaoshe Dong 1, Xingjun Zhang 1,*, Fuliang Guo 1, Yinfeng Wang 2 and Weifeng Gong 3 1 The School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China; [email protected] (L.W.); [email protected] (X.D.); [email protected] (F.G.) 2 The Shenzhen Institute of Information Technology, Shenzhen, 518172, China; wangyi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Parallel and Distributed Systems
سال: 2023
ISSN: ['1045-9219', '1558-2183', '2161-9883']
DOI: https://doi.org/10.1109/tpds.2023.3290770